Decoding lip language using triboelectric sensors with deep learning

Lip language is an effective method of voice-off communication in daily life for people with vocal cord lesions and laryngeal and lingual injuries without occupying the hands. Collection and interpretation of lip language is challenging. Here, we propose the concept of a novel lip-language decoding system with self-powered, low-cost, contact and flexible triboelectric sensors and a well-trained dilated recurrent neural network model based on prototype learning. The structural principle and electrical properties of the flexible sensors are measured and analysed. Lip motions for selected vowels, words, phrases, silent speech and voice speech are collected and compared. The prototype learning model reaches a test accuracy of 94.5% in training 20 classes with 100 samples each. The applications, such as identity recognition to unlock a gate, directional control of a toy car and lip-motion to speech conversion, work well and demonstrate great feasibility and potential. Our work presents a promising way to help people lacking a voice live a convenient life with barrier-free communication and boost their happiness, enriches the diversity of lip-language translation systems and will have potential value in many applications.


Supplementary note 1: the simulation results
Supplementary Fig. 1 The spatial distribution of electric field at different stages (Simulated by Piezoresistive, piezoelectric, capacitive and triboelectric sensors are all electromechanical sensors which transduce the applied force into electrical signals via different mechanisms. They can be divided into two groups: group A includes piezoresistive and capacitive passive sensors, group B includes piezoelectric and triboelectric active sensors.
The resistance or capacitance changes when the sensors in group A is pressed by force. The force of compression can be obtained from the change of resistance or capacitance. In order to measure the change of resistance or capacitance, an additional voltage is needed, and the change of resistance or capacitance is obtained through the change of current. Electrical signals generated by pressure or contact cannot be measured without an additional voltage.
Piezoelectric and triboelectric sensors in group B are both self-powered sensors. Piezoelectric sensors make use of the piezoelectric effect, which creates a voltage when charge accumulates on both sides of the material during compression and deformation. The triboelectric sensors make use of two effects: contact electrification and electrostatic induction. The friction between two interfaces with different surface energy generates charges, which will further induce charges on the adjacent electrodes. Change of the distance between two interfaces will cause the change of the amount of induced charges, and will produce instantaneous voltage and current in the external circuit. Electrical signals generated by pressure or contact can be measured without an additional voltage, which is called self-powered.
Comparing to group A, the self-powered characteristic of group B means low-power consumption, which is of great importance for small-scale wearable electronics (sensors and devices) and low carbon living. With the booming of the Internet of Things (IoT), numerous lightweight and wearable sensors have been developed for biomedical monitoring. In the case of a certain battery capacity, low energy consumption can prolong the working time of the device and reduce the charging frequency. In addition, group B does not require additional circuit design and power supply to generate electrical signals, whereas group A does.
Piezoelectric generators (PG) and Triboelectric nanogenerators (TENG) are two most common approaches for energy harvesting in group B. The two generators are compared (Ahmed et al 1 , 2020) at frequency values below 4 Hz, which is typical of human motions. TENG shows higher power performance and is almost independent of the operating frequency, making it highly efficient comparing to PG. Low cost is another advantage of TENG. Various thin film materials which are common in daily life, e.g. paper, can be used to fabricate the TENGs, which shall greatly lower the cost.
Sensitivity The capacitive (ESPB-01, RENHE Co. LTD), piezoresistive (DF9-16, CHENGTec Co. LTD), piezoelectric (LDT0-028K, piezoelectric Polyvinylidene Fluoride, TE) and our triboelectric sensors are collected to make the comparison, as shown in Supplementary Fig. 3. The sensitivities of these sensors are measured and analyzed, as shown in Supplementary Fig. 4. The sensitivity, trigger point and cost of each sensor are compared in Supplementary Tab. 2. Our triboelectric sensor has the lowest cost and the highest sensitivity in the four sensors. Supplementary Fig. 3 The sensors for testing, from left to right are capacitive, piezoresistive, piezoelectric and triboelectric sensors. Supplementary Fig. 4

Supplementary note 4: The normalized waveform of main vowels
Supplementary Fig. 5 The normalized lip-motion voltage waveforms of 12 vowels (a-l) and related mouth shapes

Supplementary note 5: the characteristics of lip motion signals
Supplementary Fig. 6 The characteristics of lip motion signals (a) Signals corresponding to the same pronunciation in different mouth opening size. (b) The combination and decomposed lipmotion signals of "Open", "Sesame", and "Open sesame".

Supplementary note 6: Voltage data manipulation
Taking the normalized curves in Fig.6(b) for example, the obtained signals are filtered with a cut-off frequency of 20 Hz to filter out the power-frequency electromagnetic interference; the obtained signals are intercepted and the baseline is subtracted to reduce the baseline drift caused by ultra-low frequency noises. The baseline is determined by connecting the first point of the intercepted signal to the last point. In order to reduce the difference of voltage amplitude in different lip-motion recording processes, the obtained signals were normalized, and the absolute value of each point was normalized into the interval [0,1].

Supplementary note 7: setup of the neural network classifier.
In our method, for each class, the model learns a prototype in the deep feature space. In the classification stage, classification is performed by calculating the Euclidean distance between the feature representation of the sample and the class prototype. Specifically, the feature mapping function is defined as: ( ) : ℝ → ℝ ，where is the parameters of the feature extractor, D is the dimension of input space, and d is the dimension of the deep feature space. The corresponding prototype for each category is ， ∈ (1, . . . ), in which C is the number of training classes. The probability of sample (x,y) belonging to class i is: (1) is the distance in feature space between the sample (x,y) and the prototype of class i. Based on the probability, the cross-entropy loss is: with 0.001 initial learning rate.

Supplementary note 8: Data collection process
Take the word "apple" for example. The participant speaks "apple" 150 times with a sampling rate of 500 Hz. The participant speaks the word for 15 times as a group and 10 groups in total; the participant can have a rest between groups. To control the rhythm of speech, a counter (15 counts at 4s intervals) is used to remind the participant. Then the signals are preprocessed for data recognition.

Supplementary note 9: Data preprocessing for machine learning
The data preprocessing for machine learning mainly consist of two steps. First, machine learning algorithms do not perform well when the features of input samples have very different scales. Therefore, the first step of data preprocessing is standardization, which is a commonly used strategy of feature scaling. Specifically, standardization subtracts the mean value, and divides by the variance to make the features distribution have unit variance. With standardization, the machine learning process could be much less affected by outliers.
Second, to train and evaluate the model, the collected data is divided into two sets: the training set and the testing set. The total samples (2000 samples) are shuffled firstly. The first 1600 samples are viewed as the training set, and the remaining samples are viewed as the testing set.

Supplementary note 10: the chosen words and phrases
Supplementary Fig. 7 Words list (20 fruit photographs, a-t) selected for collection and training.

Apple
Banana Bennet Berry